Lecture’s Plan

  1. Feed-forward neural net
  2. Recurrent neural net
    1. SRN
    2. LSTM
    3. GRU
    4. Bi-LSTM

What is Deep Learning (DL) ?

A machine learning subfield of learning representations of data. Exceptional effective at learning patterns.

Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers

If you provide the system tons of information, it begins to understand it and respond in useful ways.

Multi-Layer Feed-Forward Networks

  • Multi-layer networks can represent arbitrary functions, but an effective learning algorithm for such networks was thought to be difficult.
  • A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next, with activation feeding forward.

  • The weights determine the function computed. Given an arbitrary number of hidden units, any boolean function can be computed with a single hidden layer.

Neural Network Intro

\[h = \sigma(W_1x + b_1)\] \[y = \sigma(W_2h + b_2)\]

Neural Network Intro

\[h = \sigma(W_1x + b_1)\] \[y = \sigma(W_2h + b_2)\]



4 + 2 = 6 neurons (not counting inputs)

[3 x 4] + [4 x 2] = 20 weights

4 + 2 = 6 biases

26 learnable parameters

Neural Network Intro

Neural Network Intro

Neural Network Intro

Training

Gradient Descent

  • Define objective to minimize error: \[E(W) = \sum_{d \in D} \sum_{k \in K} (t_{kd} - O_{kd})^2\]

    where \(D\) is the set of training examples, \(K\) is the set of output units, \(t_{kd}\) and \(o_{kd}\) are, respectively, the teacher and current output for unit \(k\) for example \(d\).

  • The derivative of a sigmoid unit with respect to net input is: \[\frac{\partial{o_j}}{\partial{net_j}} = o_j(1-o_j)\]

  • Learning rule to change weights to minimize \[\Delta w_{ji} = - \eta \frac{\partial{E}}{\partial{w_{ji}}}\]

Backpropagation Learning Rule

  • Each weight changed by: \[\Delta w_{ji} = \eta \delta_j o_i \] \[\delta_j = o_j(1 - o_j)(t_j - o_j) \ \ \ \text{if } j \text{ is an output unit}\] \[\delta_j = o_j(1 - o_j)\sum_k{\delta_k w_{kj}} \ \ \ \text{if } j \text{ is a hidden unit}\]

    where \(\eta\) is a constant called the learning rate

    \(t_j\) is the correct teacher output for unit \(j\)

    \(\delta_j\) is the error measure for unit \(j\)

Gradient Descent

objective/cost function \(J\)\((\theta)\)

Update each element of \(\theta\):

\[\theta^{new}_j = \theta^{old}_j - \alpha \frac{d}{\theta^{old}_j} J(\theta)\]

Matrix notation for all parameters ( \(\alpha\): learning rate):

\[\theta^{new}_j = \theta^{old}_j - \alpha \nabla _{\theta}J(\theta)\]

Recursively apply chain rule though each node

Gradient Descent

objective/cost function \(J\)\((\theta)\)      Review of backpropagation

Update each element of \(\theta\):

\[\theta^{new}_j = \theta^{old}_j - \alpha \frac{d}{\theta^{old}_j} J(\theta)\]

Matrix notation for all parameters ( \(\alpha\): learning rate):

\[\theta^{new}_j = \theta^{old}_j - \alpha \nabla _{\theta}J(\theta)\]

Recursively apply chain rule though each node

Error Backpropagation

  • First calculate error of output units and use this to change the top layer of weights.

Error Backpropagation

  • First calculate error of output units and use this to change the top layer of weights.

Current output: \(o_j=0.2\)

Error Backpropagation

  • First calculate error of output units and use this to change the top layer of weights.

Current output: \(o_j=0.2\)

Correct output: \(t_j=1.0\)

Error Backpropagation

  • First calculate error of output units and use this to change the top layer of weights.

Current output: \(o_j=0.2\)

Correct output: \(t_j=1.0\)

Error \(\delta_j = o_j(1–o_j)(t_j–o_j)\)

\(0.2(1–0.2)(1–0.2)=0.128\)

Error Backpropagation

  • First calculate error of output units and use this to change the top layer of weights.

Current output: \(o_j=0.2\)

Correct output: \(t_j=1.0\)

Error \(\delta_j = o_j(1–o_j)(t_j–o_j)\)

\(0.2(1–0.2)(1–0.2)=0.128\)

Update weights into \(j\)

\[\Delta w_{ji} = \eta \delta_j o_i\]

Error Backpropagation

  • First calculate error of output units and use this to change the top layer of weights.

Current output: \(o_j=0.2\)

Correct output: \(t_j=1.0\)

Error \(\delta_j = o_j(1–o_j)(t_j–o_j)\)

\(0.2(1–0.2)(1–0.2)=0.128\)

Update weights into \(j\)

\[\Delta w_{ji} = \eta \delta_j o_i\]

Error Backpropagation

  • First calculate error of output units and use this to change the top layer of weights.

Current output: \(o_j=0.2\)

Correct output: \(t_j=1.0\)

Error \(\delta_j = o_j(1–o_j)(t_j–o_j)\)

\(0.2(1–0.2)(1–0.2)=0.128\)

Update weights into \(j\)

\[\Delta w_{ji} = \eta \delta_j o_i\]

Error Backpropagation

  • Next calculate error for hidden units based on errors on the output units it feeds into.

Error Backpropagation

  • Next calculate error for hidden units based on errors on the output units it feeds into.

\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]

Error Backpropagation

  • Next calculate error for hidden units based on errors on the output units it feeds into.

\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]

Error Backpropagation

  • Next calculate error for hidden units based on errors on the output units it feeds into.

\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]

Error Backpropagation

  • Finally update bottom layer of weights based on errors calculated for hidden units.

\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]

Error Backpropagation

  • Finally update bottom layer of weights based on errors calculated for hidden units.

\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]

Error Backpropagation

  • Finally update bottom layer of weights based on errors calculated for hidden units.

\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]

Update weights into \(j\)

\[\Delta w_{ji} = \eta \delta_j o_i\]

Error Backpropagation

  • Finally update bottom layer of weights based on errors calculated for hidden units.

\[\delta_j = o_j(1-o_j)\sum_k{\delta_kw_{kj}}\]

Update weights into \(j\)

\[\Delta w_{ji} = \eta \delta_j o_i\]

Backpropagation Training Algorithm

 

  • Create the 3-layer network with \(H\) hidden units with full connectivity between layers. Set weights to small random real values.

  • Until all training examples produce the correct value (within \(\epsilon\)), or mean squared error ceases to decrease, or other termination criteria:

    • Begin epoch

    • For each training example, \(d\), do:

      • Calculate network output for \(d\)’s input values
      • Compute error between current output and correct output for \(d\)
      • Update weights by backpropagating error and using learning rule
    • End epoch

Comments on Training Algorithm

  • Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely.
  • However, in practice, does converge to low error for many large networks on real data.
  • Many epochs (thousands) may be required, hours or days of training for large networks.
  • To avoid local-minima problems, run several trials starting with different random weights (random restarts).
    • Take results of trial with lowest training set error.
    • Build a committee of results from multiple trials (possibly weighting votes by training set accuracy).

Hidden Unit Representations

  • Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space.
  • On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc..
  • However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature.

One forward pass

Activation functions

Non-linearities needed to learn complex (non-linear) representations of data, otherwise the NN would be just a linear function \(W_1W_2x = Wx\)

More layers and neurons can approximate more complex functions

Full list: https://en.wikipedia.org/wiki/Activation_function

Activation: Sigmoid

Takes a real-valued number and “squashes” it into range between 0 and 1. \[R^n \rightarrow [0,1]\]




  • \(+\) Nice interpretation as the firing rate of a neuron
    • 0 = not firing at all
    • 1 = fully firing
  • \(-\) Sigmoid neurons saturate and kill gradients, thus NN will barely learn
    • when the neuron’s activation are 0 or 1 (saturate)
      • gradient at these regions almost zero
      • almost no signal will flow to its weights
      • if initial weights are too large then most neurons would saturate

Activation: Tanh

Takes a real-valued number and “squashes” it into range between -1 and 1. \[R^n \rightarrow [-1,1]\]




  • Like sigmoid, tanh neurons saturate
  • Unlike sigmoid, output is zero-centered
  • Tanh is a scaled sigmoid: \(tanh⁡(x)=2sigm(2x)−1\)

Activation: Tanh

Most Deep Networks use ReLU nowadays

Takes a real-valued number and thresholds it at zero \(f(x) = max(0,x)\) \[R^n \rightarrow R^n_+\]


  • \(+\) Trains much faster
    • accelerates the convergence of SGD
    • due to linear, non-saturating form
  • \(+\) Less expensive operations
    • compared to sigmoid/tanh (exponentials etc.)
    • implemented by simply thresholding a matrix at zero
  • \(+\) More expressive
  • \(+\) Prevents the gradient vanishing problem

Overfitting

 

Learned hypothesis may fit the training data very well, even outliers ( noise) but fail to generalize to new examples (test data)

Overfitting Prevention

  • Running too many epochs can result in over-fitting.

  • Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error.
  • To avoid losing training data for validation:
    • Use internal 10-fold CV on the training set to compute the average number of epochs that maximizes generalization accuracy.
    • Train final network on complete training set for this many epochs.

Regularization

Dropout
  • Randomly drop units (along with their connections) during training
  • Each unit retained with fixed probability \(p\), independent of other units
  • Hyper-parameter \(p\) to be chosen (tuned)

L2 = weight decay
  • Regularization term that penalizes big weights, added to the objective \(J_{reg}(\theta) = J(\theta) + \lambda\sum_k{\theta_k^2}\)
  • Weight decay value determines how dominant regularization is during gradient computation
  • Big weight decay coefficient &rarr big penalty for big weights
Early-stopping
  • Use validation error to decide when to stop training
  • Stop when monitored quantity has not improved after n subsequent epochs
  • \(n\) is called patience

Tuning hyper-parameters

Loss functions and output

Determining the Best
Number of Hidden Units

  • Too few hidden units prevents the network from adequately fitting the data.
  • Too many hidden units can result in over-fitting.

  • Use internal cross-validation to empirically determine an optimal number of hidden units.

Recurrent Neural Network

Recurrent Neural Networks (RNN)

  • Add feedback loops where some units’ current outputs determine some future network inputs.
  • RNNs can model dynamic finite-state machines, beyond the static combinatorial circuits modeled by feed-forward networks.

Simple Recurrent Network (SRN)

  • Initially developed by Jeff Elman (“Finding structure in time,” 1990).
  • Additional input to hidden layer is the state of the hidden layer in the previous time step.

Unrolled RNN

  • Behavior of RNN is perhaps best viewed by “unrolling” the network over time.

Training RNN’s

  • RNNs can be trained using “backpropagation through time.”
  • Can viewed as applying normal backprop to the unrolled network.

LSTM

Vanishing/Exploding Gradient Problem

  • Backpropagated errors multiply at each layer, resulting in exponential decay (if derivative is small) or growth (if derivative is large).
  • Makes it very difficult train deep networks, or simple recurrent networks over many time steps.

Long Distance Dependencies

  • It is very difficult to train SRNs to retain information over many time steps
  • This make is very difficult to learn SRNs that handle long-distance dependencies, such as subject-verb agreement.

Long Short Term Memory

  • LSTM networks, add additional gating units in each memory cell.
    • Forget gate
    • Input gate
    • Output gate
  • Prevents vanishing/exploding gradient problem and allows network to retain state information over longer periods of time.

LSTM Network Architecture

Cell State

  • Maintains a vector \(C_t\) that is the same dimensionality as the hidden state, \(h_t\)
  • Information can be added or deleted from this state vector via the forget and input gates.

Cell State Example

  • Want to remember person & number of a subject noun so that it can be checked to agree with the person & number of verb when it is eventually encountered.
  • Forget gate will remove existing information of a prior subject when a new one is encountered.
  • Input gate “adds” in the information for the new subject

Forget Gate

  • Forget gate computes a 0-1 value using a logistic sigmoid output function from the input, \(x_t\), and the current hidden state, \(h_t\):
  • Multiplicatively combined with cell state, “forgetting” information where the gate outputs something close to 0.

\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]

Hyperbolic Tangent Units

  • Tanh can be used as an alternative nonlinear function to the sigmoid logistic (0-1) output function.
  • Used to produce thresholded output between –1 and 1.

Input Gate

  • First, determine which entries in the cell state to update by computing 0-1 sigmoid output.
  • Then determine what amount to add/subtract from these entries by computing a tanh output (valued –1 to 1) function of the input and hidden state.

\[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\] \[\tilde{C}_t = tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\]

Updating the Cell State

  • Cell state is updated by using component-wise vector multiply to “forget” and vector addition to “input” new information.

\[C_t = f_t * C_{t-1} + i_t * \tilde{C}_t\]

Output Gate

  • Hidden state is updated based on a “filtered” version of the cell state, scaled to –1 to 1 using tanh.
  • Output gate computes a sigmoid function of the input and current hidden state to determine which elements of the cell state to “output”.

\[o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)\] \[h_t = o_t * tanh(C_t)\]

Overall Network Architecture

  • Single or multilayer networks can compute LSTM inputs from problem inputs and problem outputs from LSTM outputs.

LSTM Training

  • Trainable with backprop derivatives such as:
    • Stochastic gradient descent (randomize order of examples in each epoch) with momentum (bias weight changes to continue in same direction as last update).
    • ADAM optimizer (Kingma & Ma, 2015)
  • Each cell has many parameters (\(W_f\), \(W_i\), \(W_C\), \(W_o\))
    • Generally requires lots of training data.
    • Requires lots of compute time that exploits GPU clusters.

General Problems Solved with LSTMs

  • Sequence labeling
    • Train with supervised output at each time step computed using a single or multilayer network that maps the hidden state (\(h_t\)) to an output vector (\(O_t\)).
  • Language modeling
    • Train to predict next input (\(O_t =I_{t+1}\))
  • Sequence (e.g. text) classification
    • Train a single or multilayer network that maps the final hidden state (\(h_n\)) to an output vector (\(O\)).

Sequence to Sequence
Transduction (Mapping)

  • Encoder/Decoder framework maps one sequence to a “deep vector” then another LSTM maps this vector to an output sequence.

  • Train model “end to end” on I/O pairs of sequences.

Summary of
LSTM Application Architectures

Successful Applications of LSTMs

  • Speech recognition: Language and acoustic modeling
  • Sequence labeling
  • Neural syntactic and semantic parsing
  • Image captioning: CNN output vector to sequence
  • Sequence to Sequence
    • Machine Translation (Sustkever, Vinyals, & Le, 2014)
    • Video Captioning (input sequence of CNN frame outputs)

Bi-directional LSTM (Bi-LSTM)

  • Separate LSTMs process sequence forward and backward and hidden layers at each time step are concatenated to form the cell output.

Gated Recurrent Unit (GRU)

  • Alternative RNN to LSTM that uses fewer gates (Cho, et al., 2014)
    • Combines forget and input gates into “update” gate.
    • Eliminates cell state vector

\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t])\] \[r_t = \sigma(W_r \cdot [h_{t-1}, x_t])\] \[\tilde{h}_t = tanh(W \cdot [r_t * h_{t-1}, x_t])\] \[h_t = (1 - z_t) * h_{t-1} + z_t * \tilde(h)_t\]

GRU vs. LSTM

  • GRU has significantly fewer parameters and trains faster.
  • Experimental results comparing the two are still inconclusive, many problems they perform the same, but each has problems on which they work better.

Menti

Attention

  • For many applications, it helps to add “attention” to RNNs.
  • Allows network to learn to attend to different parts of the input at different time steps, shifting its attention to focus on different aspects during its processing.
  • Used in image captioning to focus on different parts of an image when generating different parts of the output sentence.
  • In MT, allows focusing attention on different parts of the source sentence when generating different parts of the translation.

Attention Mechanism

Attention - Scoring

\[score(h_{t-1},\bar{h}_s) = h_t^T\bar{h}_s\]

Compare target and source hidden states

Attention - Scoring

\[score(h_{t-1},\bar{h}_s) = h_t^T\bar{h}_s\]

Compare target and source hidden states

Attention - Normalization

\[a_t(s) = \frac{e^{score(s)}}{\sum_{s'}{e^{score(s')}}}\]

Convert into alignment weights

Attention - Context

\[c_t = \sum_s{a_t(s)\bar{h}_s}\]

Build context vector: weighted average

Attention - Context

\[h_t = f(h_{t-1}, c_t, e_t)\]

Compute next hidden state

Summary

Summary: what did we learn?

Time for Practical 6!